Evidence from comprehensive independent validation studies for smooth pursuit dysfunction as a sensorimotor biomarker for psychosis

Smooth pursuit eye movements are considered a well-established and quantifiable biomarker of sensorimotor function in psychosis research. Identifying psychotic syndromes on an individual level based on neurobiological markers is limited by heterogeneity and requires comprehensive external validation to avoid overestimation of prediction models. Here, we studied quantifiable sensorimotor measures derived from smooth pursuit eye movements in a large sample of psychosis probands (N = 674) and healthy controls (N = 305) using multivariate pattern analysis. Balanced accuracies of 64% for the prediction of psychosis status are in line with recent results from other large heterogenous psychiatric samples. They are confirmed by external validation in independent large samples including probands with (1) psychosis (N = 727) versus healthy controls (N = 292), (2) psychotic (N = 49) and non-psychotic bipolar disorder (N = 36), and (3) non-psychotic affective disorders (N = 119) and psychosis (N = 51) yielding accuracies of 65%, 66% and 58%, respectively, albeit slightly different psychosis syndromes. Our findings make a significant contribution to the identification of biologically defined profiles of heterogeneous psychosis syndromes on an individual level underlining the impact of sensorimotor dysfunction in psychosis.


Machine training and internal validation: B-SNIP1
The model distinguished psychosis probands from healthy controls by SPEM variables with a mean balanced accuracy of 63.96% (p < 0.001, Table 3; for further results parameter refer to Supplementary Table 4).On average 53% of the psychosis probands and 75% of the control subjects were correctly classified (sensitivity = 52.97%,specificity = 74.96%,Table 3).Mean likelihood ratios 42 resulted in: positive test result = 2.18, negative test result = 0.63.

External validation-1: B-SNIP2
Validation in the B-SNIP2 sample included n = 666 psychosis probands and n = 289 healthy controls (n = 64 participants could not be entered into the machine due to at least one missing value).Emphasizing high validity, the B-SNIP1 derived model discriminated psychosis probands from healthy controls in the independent B-SNIP2 sample with a balanced accuracy of 65.03% (see Table 3 and Supplementary Table 4).About 56% of the psychosis probands and 74% of the control subjects were correctly classified (sensitivity = 56.01%,specificity = 74.05%,Table 3).

External validation-2: PARDIP
For the PARDIP sample, n = 44 bipolar probands with psychosis symptoms, n = 33 bipolar probands without psychosis symptoms and n = 70 healthy controls were included in the validation procedure (n = 9 participants were excluded due to at least one missing value).Our trained model could distinguish bipolar probands with psychosis symptoms from healthy controls with a balanced accuracy of 65.52% (Table 3 and Supplementary Table 4).About 68% of the bipolar probands with psychosis were correctly classified as psychosis probands and 63% of the control subjects were correctly classified as healthy controls (sensitivity = 68.18%,specificity = 62.86%,Table 3).Furthermore, about 61% of the bipolar probands without psychosis symptoms were classified as controls (which means that they are closer to the healthy non-psychotic than the psychosis category, Table 3).

External validation-3: FOR2107
To validate the machine in predominately affective psychopathology, data from n = 94 probands with major depression and n = 25 probands with bipolar disorder, both groups without psychotic symptoms, from the FOR2107 consortium were entered into the analyses.Using the B-SNIP1 machine, nearly 81% of the probands with major depression and 60% of the probands with bipolar disorder were classified as being closer to the healthy non-psychotic than the psychosis category, Table 3.

External validation-4: PRONIA
Validation in high-risk and recent-onset psychotic or depressive disorder could be computed in n = 11 probands with recent-onset psychosis, n = 17 probands with recent-onset depression, n = 19 participants with clinical high risk of psychosis, and n = 16 controls (PRONIA study).Emphasizing the validity of the machine, about 94% of the controls were categorized as healthy.However, in contrast to previous results in chronically ill psychosis probands, only 18% of the recent-onset psychosis probands were classified as psychosis patients (Table 3).Interestingly, the machine labeled nearly 42% of the participants with clinical high risk of psychosis as psychosis probands (Table 3).Of the probands with recent-onset, non-psychotic depression, 76% were classified as healthy controls (Table 3).ROD recent-onset depression probands, CHR clinical-high-risk-for psychosis probands, ROP recentonset-psychosis probands, SD standard deviation.a Significant differences between controls and psychosis probands were found for maintenance gain, early gain, initial eye acceleration, and latency (all p > .001).b Significant differences between controls and psychosis probands were found for maintenance gain, early gain, initial eye acceleration, and latency (all p > 0.001).c Significant differences were found for maintenance gain (p < 0.001; BPwP < controls), early gain (p < 0.001; BPwP < controls and BPwoP), and initial eye acceleration (p = 0.008; BPwP < controls).d Significant differences were found for maintenance gain (p = 0.009; psychosis probands < controls and MDwoP), early gain (p < 0.001; psychosis probands < controls and MDwoP), and initial eye acceleration (p < 0.001; psychosis probands < MDwoP, BPwoP < MDwoP).e Significant differences were found for early gain (p < 0.001; CHR < controls).www.nature.com/scientificreports/

Discussion
In the current study we examined a set of traditional SPEM measures (i.e.predictive eye velocity maintenance gain, early eye velocity maintenance gain, initial eye acceleration, and eye latency; Leigh & Zee 17 ; Lencer et al. 20 ) and their interactions as quantifiable biological indicators of psychosis-related visual sensorimotor dysfunction in large samples of probands with psychotic disorders.This is an important approach since identified SPEM deteriorations point to specific deficits in the transformation of sensory motion signals into motor action being associated with alterations in occipito-parieto-frontal networks 24,43 .
To overcome limitations by classical frequentist statistics, we implemented multivariate pattern analyses (e.g.supervised machine learning approaches) 44 using internal (i.e. a hold-out subsample consisting of participants that were not used for training) and external (i.e. an independent dataset) validation in sufficient large data samples 11 to allow for clinically relevant single-subject statements pointing to sensorimotor transformation deficits.Most importantly, we not only trained and internally validated the machine-learning algorithm in a single sample but also applied and externally validated the machine in an independent large sample of psychosis Although a balanced accuracy score of nearly 64% as derived from our training sample (B-SNIP 1) may be regarded as insufficient for SPEM performance to be used as a single screening instrument for determining psychosis-related sensorimotor transformation function, it significantly exceeds chance level and remains within the range of expectable results in similar heterogenous psychiatric sample sizes 11 .Additionally, a likelihood ratio for a positive test result of 2.18 could be interpreted as small (but important) changes in probability 42 .Our second key finding emphasizes the generalization to new data when applying the model to an independent cohort of chronically ill psychosis probands and healthy controls.Regarding the first external validation in the B-SNIP2 sample (external validation-1), our machine yielded a comparable (even slightly higher) balanced accuracy of 65.03% when discriminating the two groups.This result is particularly meaningful due to (a) the independence of both data sets and (b) slight differences in the SPEM task design underlining the robustness of classification results by our model.A third cohort with chronically ill psychosis probands and healthy controls was derived from the FOR2107 consortium (external validation-3) and could be classified correctly with a balanced accuracy of 57.64%.
Our findings support the original suggestions by Diefendorf and Dodge 45 to use SPEM as a neurobiological diagnostic tool coming with multiple advantages including standardized measurements and brief 5-min testing feasible even for severely impaired patients.Here, we applied a constellation of SPEM tasks consisting of full-ramp and foveo-petal step-ramp trials at 18.7 degrees of visual angle constant velocity.These specific SPEM tasks allow the computation of the four key measures to evaluate SPEM performance and can be recommended for future studies.Our results add to previous findings based on traditional group analyses in indicating that SPEM is a valuable psychosis-related biomarker of sensorimotor integrity being useful even at the single-subject level 20 .Besides its diagnostic value this biomarker bears highly relevant information for establishing personalized treatment regimes.
Very recently St Clair and colleagues 46 applied a multiclass machine-learning model to differentiate patients with schizophrenia, bipolar affective disorder, major depression disorder, and healthy controls on the basis of 98 eye movement symptoms (including several SPEM variables).The model was tested in two validation sets achieving balanced accuracies for schizophrenia patients of 73% and 75%.Both validation sets were relatively small (test-1 internal validation: n = 30 schizophrenia, n = 35 bipolar, n = 33 depression, n = 35 controls; test-2 external validation: n = 60 schizophrenia, n = 184 controls) which entails an increased risk of misclassification 11 .To avoid this common short coming we have used a large internal validation sample as well as applied our machine to several extensive independent data sets.Of note, the task from St Clair and colleagues took about 15 min in total yielding a total of 98 eye movement measures 47 derived from free viewing, fixation duration, and smooth pursuit tasks 46 limiting its clinical practicability.
To further determine the model's specificity regarding the relationship between psychotic symptoms and SPEM performance we applied the machine to other patient groups.To this regard, there has been an extensive discussion concerning similarities and differences between schizophrenia and bipolar disorder 48 .Machinelearning models based on brain data have been used to discriminate both patient groups 49 , though often merging data from bipolar patients with and without history of psychotic episodes 50 .
Similarly, St Clair and colleagues 46 did not specify psychosis symptoms in those patients suffering from bipolar disorder and major depression which we found has a significant impact as demonstrated by our external validation-2 sample from the PARDIP.In line with the idea of the relationship between SPEM deterioration and psychotic psychopathology, our machine classified about 68% of the bipolar probands with psychosis correctly as psychosis patients, while 61% of the bipolar probands without psychosis symptoms were classified as healthy (which means that they are closer to healthy individuals).Underlining its generalizability, 60% of the bipolar probands without psychotic symptoms from the FOR2107 study (external validation-3) were also rated closer to the healthy non-psychotic category.
Broadening the perspective of specificity regarding SPEM deficits in affective disorders, we found that nearly 81% of probands suffering from major depression without psychotic episodes (FOR2107 study, external validation-3) were classified as healthy indicating closer affiliation to the non-psychotic category.This result is in line with previous findings of only minor impaired SPEM performance from traditional group statistics 36 and multivariate pattern analyses based on brain data indicating major depression and schizophrenia as two end points of an interjacent continuum 50 .
Our external-validation sample 4 from the PRONIA study was used to test our model in young probands being at clinical high risk for psychosis or experiencing a first psychotic or first depressive episode.Interestingly, about 42% of probands with clinical high risk of psychosis were categorized as psychosis probands which might support the idea of an underlying susceptibility of SPEM deficits in the psychosis spectrum 51 .Indeed, the specific SPEM measures of predictive and early maintenance gain indicated the worst performance in this proband group compared to all three other PRONIA groups (see Table 2).However, this group is extremely heterogeneous as indicated by large standard deviations in the early and maintenance gains (see Table 2).Note, transition rates for CHR to ROP are about 25% within 3 years indicating a high heterogeneity of CHR subjects regarding susceptibility to psychosis 52 .In contrast, in the relatively small (n = 11) and heterogeneous sample of www.nature.com/scientificreports/recent-onset psychosis probands our machine only classified two probands (18%) as belonging to the psychosis group.Despite the small sample size, this observation points to possible differences in SPEM performance between recent-onset and chronic states of psychosis (see also Table 1 for information about illness duration) as discussed previously 53 .That study observed subtle impairments of immediate sensorimotor processing in first-episode psychosis patients with only short duration of treatment, e.g. after 6 weeks, which appeared to be compensated by predictive drive to pursuit.In more detail, first-episode patients demonstrated slightly worse performance in the pure-ramp task (comparable to the step-ramp task in the current study) but were unaffected in the oscillating task (comparable to the triangle wave task in the current study).Deficits were discussed as possible medication effects with regard to their serotonergic antagonism of brainstem sensorimotor systems.However, same as in the present study, no associations between SPEM variables and medication dosage were found 53 .Indeed, in our ROP group (which might be comparable to the first-episode patients after short duration of treatment from the study by Lencer and colleagues 53 ), early maintenance gain -driven by immediate sensorimotor processing-was considerably reduced while predictive maintenance gain was unaffected (see Table 2).Notably, 76% of probands with recent-onset depression and 94% of healthy controls from the PRONIA sample were correctly classified as not belonging to the psychosis group.Despite the clear strengths of the study, some limitations need to be discussed: (1) SPEM results for initial eye acceleration and latency differed between laboratories/recording devices (Supplementary Table 11).To estimate the impact of these two variables on the prediction of our machine, we additionally trained a machine in the B-SNIP1 sample using only the two eye velocity gain measures as predictors.The machine was able to distinguish psychosis probands from healthy controls with a balanced accuracy of 61.90% (Supplementary Table 12) which is close to the main result using all SPEM variables (63.96%).However, laboratory conditions and/or recording devices may have an impact on the measurement of SPEM initial eye acceleration and latency that could have affected prediction results.(2) As we trained the machine in a sample of chronically ill psychosis probands, possible effects of medication have to be taken into account.Although we found only small and inconsistent correlations between SPEM and chlorpromazine equivalents, effects of medication cannot be fully ruled out 53 .
(3) Furthermore, we found significant differences in cognition scores between psychosis probands and healthy controls in the B-SNIP1 sample.There might be effects of cognitive skills that cannot be entirely discarded.(4) Despite our comprehensive validation samples, our machine was not validated in a group of MDD with psychosis.(5) There is a discrepancy between sensitivity (53%) and specificity (75%) implying our model to be particularly suitable to correctly identify healthy probands as healthy.( 6) No follow-up data of samples from the PRONIA study is available to evaluate transition rates of those CHR participants with bad SPEM performance.
Our comprehensive findings support SPEM as an indicator of sensorimotor transformation impairments relevant to patients suffering from chronic psychosis.Thus, our machine learning algorithm based on the performance in a 5 min SPEM task can help to obtain an overview of sensorimotor transformation profiles on an individual level that might inform treatment decisions in rehabilitation contexts, e.g.regarding sensorimotor remediation strategies.
Future studies should broaden this biomarker approach by combining indicators of sensorimotor function with multiple other relevant neurobiological measures, e.g.brain structure indices, to improve individual prediction accuracies and to inform personalized therapeutic decisions for psychotic disorders.Additionally, future studies should target the question whether SPEM-Impairments can indicate illness progression independently from the factor of illness duration.

Subjects
SPEM data from five independent samples were included in the following analyses (Fig. 1):

B-SNIP1
First, the machine was trained and internally validated with SPEM data from the B-SNIP1 sample consisting of n = 674 chronically ill psychosis probands (n = 265 schizophrenia, n = 178 schizoaffective, and 231 bipolar with psychotic symptoms) and 305 healthy controls.Participants were recruited by the B-SNIP consortium across five sites in the US (Baltimore, Boston, Chicago, Dallas, Hartford; Tamminga et al. 27 ).Diagnoses were derived by a consensus of experienced clinicians based on all available clinical information and the Structured Clinical Interview for DSM-IV 54 .Inclusion criteria comprised (1) age between 15 and 65 years; (2) reading score of the Wide Range Achievement Test ≥ 60 55 ; (3) no history of a neurologic disorder; (4) normal or corrected to normal vision (minimum of 20/40 acuity), (5) no history of substance abuse within the last month or substance dependence within the last three months, and negative urine toxicology on study day.Additionally, healthy controls were not allowed to have a personal or family history (first-degree) of psychotic or bipolar disorders, to have a history of recurrent mood disorder or to exhibit a history of psychosis spectrum personality traits 56 .The protocol of the study was approved by institutional review boards at each of the study sites and participants provided written informed consent.For group differences in SPEM performance see Lencer et al. 20 .
Second, the remaining study samples were used (a) as external validation data for the machine trained in the B-SNIP1 sample and (b) for investigating psychosis-related specificity of SPEM against probands with predominately affective disorders.

External validation-1: B-SNIP2
B-SNIP2 is the follow-up to B-SNIP1.SPEM data were available from n = 727 chronically ill psychosis probands (n = 288 schizophrenia, n = 264 schizoaffective, and 175 bipolar with psychotic symptoms) as well as n = 292 healthy controls recruited in Boston, Chicago, Dallas, Hartford, and Athens (GA).Inclusion criteria were  62 , but SPEM data have not been published so far.
The PARDIP study took place in Dallas, Boston, and Hartford.It was nested within the B-SNIP consortium using similar inclusion criteria but, importantly, there was no overlap between PARDIP and B-SNIP participants.Further information on inclusion criteria and group differences in SPEM performance see Brakemeier et al. 57 .
External validation-4: PRONIA Following an exploratory approach for testing the validity of our machine developed in stable probands with chronic psychosis, SPEM data were also collected in collaboration with the multisite PRONIA consortium (https:// www.pronia.eu/, Koutsouleris et al. 59 ) from n = 11 probands with recent-onset psychosis, n = 19 probands with a high clinical risk for the development of psychosis, n = 17 probands with recent-onset depression, and n = 16 healthy controls at the Münster site.All patients were medicated as prescribed by their doctors except for regular or current sedative medication which was an exclusion criterium (see chlorpromazine equivalents at time of testing in Table 1).Note, prior to inclusion ROP patients from the PRONIA sample had not been allowed to take any antipsychotic medication for longer than 90 days (within the past 24 months) with a daily dose rate at or above the minimum dosage of DGPPN S3-guidelines 60 .
Participants gave written informed consent according to the Declaration of Helsinki.Each study was approved by the respective local ethics committee.

Eye movement measurement and task
At all sites, the SPEM target consisted of a small stimulus (0.5°) moving back and forth in the horizontal plane at 18.7°/s constant velocity displayed on a monitor to constitute full-ramp trials within triangle wave tasks and foveo-petal step-ramp trials 61 .Participants were instructed to follow the stimulus with their eyes as accurately as possible while sitting in front of the monitor with their heads stabilized using a chin and forehead restraint.Across all studies, eye movements were recorded in a quiet and darkened room.
For B-SNIP1, B-SNIP2, and PARDIP samples (for further details refer to Brakemeier et al. 57 ; Huang et al. 62 ; Lencer et al. 20 ), participants were seated 60 cm from a 22-inch CRT monitor (1360 × 768 resolution; 150 Hz refresh rate) and eye movements were recorded using an Eyelink II (SR Research Ltd., Ontario/Canada) recording device at 500 Hz sampling rate.The stimulus comprised a red cross in a box covering 0.5° moving horizontally between ± 12° across the screen.
In B-SNIP1 and PARDIP studies, 48 full-ramp and 32 foveo-petal step-ramp trials 61 both at 18.7° of visual angle per second constant velocity, were applied in order to assess SPEM performance.In full-ramp trials, the stimulus moved back and forth with constant velocity in a triangular waveform.During step-ramp trials, the target started from the central position, stepped either to the right or the left (2.4° of visual angle in a randomized order) and afterwards moved towards the peripheral opposite direction at 18.7° of visual angle per second constant velocity.The stimulus re-crossed the central line after 133 ms allowing the initiation of SPEM without a necessary catch-up saccade 61 .Additionally, some trials with 9.7° of visual angle/second and 26.6° of visual angle/ second target velocities as well as trials with intervals where the target was blanked were displayed to enhance attention but were not included into the analyses (30% of trials).In order to ensure data quality, additional calibration trials were presented between blocks of trials.SPEM measurement was conducted identically across sites.
For B-SNIP2 a slightly different task order was applied: Here, 48 full-ramp trials at 18.7° of visual angle per second constant velocity and a total of 48 foveo-petal step-ramp trials (32 × 18.7° of visual angle/second; 8 × 9.7° of visual angle/second; 8 × 26.6° of visual angle/second; randomized for direction; Rashbass 61 ) were presented in two test sets (each consisting of 48 trials) including six alternating blocks of either eight full-ramp or step-ramp trials.
Step-ramp trials at 9.7° of visual angle/second and 26.6° of visual angle/second velocities were shown to enhance attention but were not included into the analyses.To ensure data quality, additional calibration trials were displayed between blocks of trials.SPEM measurement was conducted identically across sites.
FOR2107 and PRONIA eye movements were recorded using an Eyelink 1000 (SR Research Ltd., Ontario/ Canada) recording device at 500 Hz sampling rate.Participants were seated 60 cm from a 22-inch CRT monitor (1360 × 768 resolution; 150 Hz refresh rate).Stimulus and task were identical to B-SNIP2.

Eye movement data processing
All SPEM data were analyzed using the identical routines in MatLab (The MathWorks, Natick, MA) developed by one of the authors (AS).Eye position data were filtered using a one-dimensional Gaussian filter (30 Hz) and, subsequently, smoothed eye velocity was computed with central median differentiation of 9 ms 20,57,63  To assess the different sensorimotor aspects of SPEM performance, the following variables were computed 20,36,57 (see Fig. 2, adapted from Ref. 57 ): Predictive maintenance gain during continuous pursuit was calculated from triangular wave tasks as the ratio of median eye velocity to target velocity from middle sections (300-840 ms after stimulus direction reversal) over all full-ramp trials (total duration of a ramp is 1200 ms).Predictive maintenance gain highly depends on predictive drive, i.e. cognitive input to the pursuit system for sustained SPEM under closed-loop conditions.
In contrast, measures from foveo-petal step-ramp tasks represent rapid sensorimotor transformations using immediate visual motion and early performance feedback.
This included first, early maintenance gain as the ratio of median eye velocity to target velocity from middle sections (350-550 ms after stimulus onset) over all unpredictable step-ramp trials, thus reflecting early eye velocity under visual feedback control 64 .Typically, early maintenance gain is considerably lower compared to sustained predictive maintenance gain.
Second, for the computation of initial eye acceleration under open-loop conditions, when visual feedback is not yet available, eye velocity was smoothed using a Savitzky-Golay finite impulse response filter (polynomial  www.nature.com/scientificreports/order of 3 and a frame length of 63).The onset of eye acceleration was defined as eye velocity exceeding a noise threshold (above 3.2 standard deviations of mean resting eye velocity which was calculated from 200 ms before to 100 ms after ramp-onset, Carl & Gellman 65 ) for at least 20 ms.Initial eye acceleration was then computed using robust linear regression slope (RobustFit ® in MatLab) in a 100 ms time window starting with the acceleration onset over all trials.Third, eye latency was determined as time that had elapsed between onset of stimulus movement and onset of eye acceleration 65 over all trials.

Psychometric, cognitive, and clinical measures
Psychosis-related symptoms For B-SNIP1, B-SNIP2, PARDIP, and PRONIA studies, psychosis-related symptoms were rated using the positive and negative syndrome scale (PANSS) 66 while the FOR2107 study used the Scale for Assessment of Positive Symptoms (SAPS) and the scale for assessment of negative symptoms (SANS) 67 .To provide comparability, SANS and SAPS scores were converted to PANSS scores 68 , see Supplementary Table 1.

Depression
Depressive symptoms were quantified with the Montgomery-Åsberg Depression Rating Scale (MADRS; Montgomery & Åsberg 69 ) in the B-SNIP1, B-SNIP2 and PARDIP studies and using the original Beck Depression Inventory in the 1978 version 70 in the FOR2107 sample.For PRONIA, the Beck Depression Inventory-II (BDI-II) 71 was applied.Severity gradation (MADRS 72 , BDI 71 ) is given in Supplementary Table 2.

Mania
For B-SNIP1, B-SNIP2, PARDIP and FOR2107 samples, mania was estimated using the Young Mania Rating scale 73 .Mania was not assessed in the PRONIA sample.

Cognitive abilities
A total score indicating cognitive abilities was estimated using the Wide Range Achievement Test 4 (WRAT4 55 ) in the B-SNIP1, B-SNIP2, and PARDIP samples.For the FOR2107 study, the Multiple-Choice Vocabulary Test, version B (MWT-B 74 ) was used.Scores were converted to the IQ scale 74 .For the PRONIA sample the Wechsler adult intelligence scale matrix reasoning 75 was applied to evaluate cognition.

Machine learning approach
The machine learning model was trained in the B-SNIP1 sample to distinguish psychosis probands from healthy (non-psychotic) controls using PHOTONAI software 76 and scikit-learn toolboxes 77 .A k-fold nested cross-validation procedure was applied to split data used to train the model from data taken for internal validation.Thus, to obtain the most informative model, parameters were optimized using an inner cycle (10 folds) and the best performing model chosen by highest balanced accuracy ([sensitivity + specificity]/2 taking into account imbalanced data sets) was deployed to an outer cycle (3 folds).Special attention was given to ensure that there was (1) no information leakage between train and validation data 76 and (2) a sufficient large validation set to provide stable and meaningful results for unseen (external) samples 11 .For specifications of the best model see Supplementary Table 3.
For each of the models the following preprocessing steps were applied: (1) SPEM variables were standardized by scaling.(2) Missing values (predictive maintenance gain = 0%, early maintenance gain = 0.51%, initial eye acceleration = 1.23%, eye latency = 0.51%) were imputed with the median of the corresponding variable.(3) In order to consider different group sizes (674 psychosis probands and 305 healthy controls), data were balanced by either randomly under sampling the majority class or oversampling the minority class using SMOTE 78 .(4) Principal component analysis was applied to reduce the dimensional space.
Predictors included the four SPEM variables described above (i.e.predictive maintenance gain, early maintenance gain, initial eye acceleration, and eye latency).Then, multiple classifiers with default parameters were used to optimize representation of the underlying data (Support vector machine, Random forest, Gaussian naïve bayes, Logistic regression, Ada boost) and to discriminate the label group membership (i.e.psychosis proband or healthy control).Additionally, for the support vector machine, kernel (linear, rbf) and regularization (C = [0.1,0.3, 0.5, 0.7, 0.9, 1]) parameters were optimized.
Statistical inference was examined using permutation tests 79 .Therefore, true results were compared to a permutation distribution created from 1000 random rearrangement of the two group labels (healthy controls vs. psychosis group) to the predictors.
Additionally, we trained machine learning algorithms to separate psychosis probands in the B-SNIP1 sample.In line with the idea of SPEM deterioration across the whole psychosis spectrum, results for distinguishing individual proband groups are close to chance level (balanced accuracies: schizophrenia vs. schizoaffective probands 52.65%, schizophrenia vs. bipolar probands 52.48%, schizoaffective vs. bipolar probands 51.00%, Supplementary Table 5).
External validation of the model was investigated by applying the best performing model from B-SNIP1 to B-SNIP2 (external validation-1), PARDIP (external validation-2), FOR2107 (external validation-3), and PRONIA (external validation-4) samples.Here, in accordance with the idea that there is a specific relationship between SPEM performance and psychosis syndromes, we also applied the model to other non-psychotic psychiatric patient groups expecting them not to be classified as psychosis probands (thus be closer to the healthy nonpsychotic control group).Kendall's Tau correlation coefficients were computed between SPEM measures and chlorpromazine equivalents 80 .Additionally, correlations were calculated between SPEM measures and WRAT4 scores as well as z-scores of the Brief assessment of cognition in schizophrenia (BACS; Keefe et al. 81 ).Analyses were computed in the B-SNIP1 sample.Results are reported using Bonferroni-Holm-corrected alpha level adjusted for each of the studies over all four SPEM variables 82,83 .

Figure 2 .
Figure 2.Examples of pursuit stimuli with pursuit recordings (eye position and eye velocity) in a control subject and a psychosis proband.Foveopetal step-ramp tasks (A) are used to measure saccade free pursuit initiation.Variables of interest are pursuit latency (time between target step and green dot), initial eye acceleration (blue line) and early maintenance gain (blue line in grey shaded intervals).Triangular wave tasks (B) are used to measure sustained predictive maintenance gain in predefined intervals (blue line in grey shaded intervals) excluding artifacts induced by target reversals.The figure has been adapted from one of our prior publications by Brakemeier and colleagues57 .
https://doi.org/10.1038/s41598-024-64487-6www.nature.com/scientificreports/To examine the effect of sample size on model performance, additional models were trained and internally validated in randomly selected half of the B-SNIP1 and in the combined B-SNIP1 and B-SNIP2 samples.

Table 1 .
Descriptive information and clinical characteristics for proband groups by study.CPZ equivalents Chlorpromazine equivalents, B-SNIP Bipolar-Schizophrenia Network on Intermediate Phenotypes, SZ probands with schizophrenia, SAD probands with schizoaffective disorder, BP probands with bipolar disorder, PARDIP Psychosis and Affective Research Domains and Intermediate Phenotypes, BPwP bipolar probands with psychosis, BPwoP bipolar probands without psychosis, FOR2107 DFG Forschergruppe 2107, MDwoP probands with major depression without psychosis, PRONIA Personalised Prognostic Tools for Early Psychosis Management, ROD recent-onset depression probands, CHR clinical-high-risk-for psychosis probands, ROP recent-onset-psychosis probands.Table represents means and standard deviations given in parentheses.a Group of psychosis probands include SZ, SAD, and BP probands, b Group of psychosis probands include n = 27 SZ, n = 18 SAD, n = 2 Brief psychotic disorder, n = 1 Delusional disorder, n = 2 BPwP, n = 1 MDwP; due to small samples, no specific results for psychosis subgroups are given, c Total score indicating cognition abilities were estimated using the following measures: B-SNIP1, B-SNIP2, PARDIP = Wide Range Achievement Test 4 (WRAT4 55 ; FOR2107 = Multiple Choice Vocabulary Test, version B (MWT-B 74 , original MWT-B scores were transformed to the IQ scale 74 ; PRONIA = Wechsler Adult Intelligence Scale Matrix Reasoning 75 .d Psychosis features were estimated using the following measures the following measures: B-SNIP1, B-SNIP2, PARDIP, and PRONIA = Positive And Negative Syndrome Scale (PANSS 66 ); FOR2107 = Scale for Assessment of Positive Symptoms (SAPS) and Scale for Assessment of Negative symptoms (SANS) 67 , SAPS and SANS global/summary scores were converted to PANSS scores 68 ; e Depressive symptoms were estimated using the following measures: B-SNIP1, B-SNIP2, PARDIP = Montgomery-Åsberg Depression Rating Scale (MADRS 69 ); FOR2107 = Original Beck Depression Inventory, 1978 version 70 ; PRONIA = Beck Depression Inventory-II (BDI-II 71

Table 2 .
Descriptive results of smooth pursuit eye movements for proband groups by study.

Table 3 .
Prediction accuracies for all samples and model results for the comparison of chronic psychosis probands vs. controls.B-SNIP bipolar-schizophrenia network on intermediate phenotypes, PARDIP psychosis and affective research domains and intermediate phenotypes, BPwP bipolar probands with psychosis, BPwoP bipolar probands without psychosis, FOR2107 DFG Forschergruppe 2107, MDwoP probands with major depression without psychosis, PRONIA personalised prognostic tools for early psychosis management, ROD recent-onset depression probands, CHR clinical-high-risk-for psychosis probands, ROP recent-onsetpsychosis probands, BAC balanced accuracy score.ap-value < 0.001, indicated for BAC score as this metric was used to identify the best performing model.bBAC, sensitivity, and specificity can only be computed for comparisons including psychosis patients and controls (as these groups from the B-SNIP1 sample were used in the machine training and internal validation processes).For all other samples true label and predicted label (psychosis proband or healthy control) are given.Main results are displayed in bold.probands and healthy controls (external validation-1: B-SNIP2), in a sample of bipolar probands with and without psychotic symptoms (external validation-2: PARDIP), in a sample of probands with affective disorders without psychotic symptoms and psychosis probands (external validation-3: FOR2107), and in a sample with recent-onset psychosis or depression and clinical high risk of psychosis (external validation-4: PRONIA).Our main finding shows high consistency for the identification of psychosis probands vs. healthy controls by these sensorimotor indicators throughout the four different samples (B-SNIP1: 63.96%, B-SNIP2: 65.03%, PARDIP: 65.52%, FOR2107: 58.37%).However, it is important to consider that our model performed notably better in accurately classifying controls as controls (specificities in the different samples ranged from 63 to 75%) than psychosis probands as psychosis probands (sensitivities ranged from 43 to 68%).